1 Titanic challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

1.1 Executive Summary


Figure 1:

Figure 1:

# Load libraries
library(tidyverse) # Data manipulation
library(mlr)
library(rpart) # Decision Tree
library(rpart.plot)  # Plot Decision tree
library(randomForest) # Random forest algorithm
library(ggplot2) # Visualization
library(plotly) # Interactive visualizations
set.seed(21) # Specify Seed (Usuful for replications)

library(kableExtra)

2 Data

2.1 Data Extraction



Let’s load the titanic data using the function read.csv(). The titanic dataset is stored as a csv file, and it is in the subfolder titanic. Click the button code to see the R-code.

# Load datasets
titanic.dataset<-read.csv('titanic\\train.csv') 

2.2 Data Exploration



There are 10 attributes:

  • Name: Passenger’s name,
  • Sex: Passenger’s sex,
  • Age: Passenger’s age,
  • SibSp: Number of siblings/spouses aboard,
  • Parch: Number of parents/children aboard,
  • Pclass: Passenger’s class,
  • Ticket: Ticket number,
  • Fare: Fare,
  • Cabin: Cabin,
  • Embarked: Port of embarkation,
  • Survived: Survived (1) or died (0).

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

SibSp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

 options(knitr.kable.NA = '')
titanic.dataset%>%
  select(-c(PassengerId,Name,Ticket))%>%  # Remove columns from summary
  summary()%>% # The function is useful to get quick description of the data
  knitr::kable(digits=2 )%>% # Create table in HTML
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>% # Style
  row_spec(0, bold = T, color = "white", background = "#4056A1")
Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
Min. :0.0000 Min. :1.000 female:314 Min. : 0.42 Min. :0.000 Min. :0.0000 Min. : 0.00 :687 : 2
1st Qu.:0.0000 1st Qu.:2.000 male :577 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91 B96 B98 : 4 C:168
Median :0.0000 Median :3.000 Median :28.00 Median :0.000 Median :0.0000 Median : 14.45 C23 C25 C27: 4 Q: 77
Mean :0.3838 Mean :2.309 Mean :29.70 Mean :0.523 Mean :0.3816 Mean : 32.20 G6 : 4 S:644
3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00 C22 C26 : 3
Max. :1.0000 Max. :3.000 Max. :80.00 Max. :8.000 Max. :6.0000 Max. :512.33 D : 3
NA’s :177 (Other) :186

We format the tables using the package knitr .

Let’s have a lock at the top 3 rows using the function head().

titanic.dataset%>%
  select(-c(PassengerId,Ticket))%>%
  head(3)%>%
  kable()%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>% 
  row_spec(0, bold = T, color = "white", background = "#4056A1")
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
0 3 Braund, Mr. Owen Harris male 22 1 0 7.2500 S
1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 71.2833 C85 C
1 3 Heikkinen, Miss. Laina female 26 0 0 7.9250 S



2.3 Visualizations



How you visualize a variable depends on whether the variable is categorical or continous. What makes a chart effective is the depth of a critical analysis displayed.
In other words, do you have information worth making a chart for? and have you portrayed it accuretely? There 4 steps to create an effective chart1.

  • Research:
  • Edit: identify key message, make numerical adjustemt to enhance your point.
  • Plot: choose the right plot to present your findings and settings (scale, baseline).
  • Review: check your plot against your sources.


Let’s look at the age distribution of those passangers who survived and those who didn’t. The plot below shows that the survival chances of children are relatively high.

col1<-c('0'='#E8A87C','1'='#85CDCA')

g<-titanic.dataset%>%
  ggplot()+
  geom_density(aes(x=Age,fill=factor(Survived)),alpha=0.7)+
  scale_fill_manual('Survive',values=col1)+
  theme_classic()+ 
  ylab('Density') # Convert ggplot to  an interctive plot with plotly
ggplotly(g)

The interactive plots are done using the library plotly. The advantage of creating interactive plots with plotly and ggptlot are twofold. First, you create a ggplot which can be saved as png, pdf, etc., and you can use this plots in powerpoint presentations, add them in word documents. If you want to save your ggplot as pdf, you replace the command ggplotly(g) with ggsave(g). Second, creating an interactive plot after using ggplot requires one function ggplotly(). If you would like to learn more about plotly, visit the following page.



2.4 Data Cleaning



The process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.

2.5 Filling Missing Values techniques:



Several Machine Learning algorithms do not work with missing values. These are a some techniques to handle missing values:

  • Do nothing.
  • Remove observations with missing values (raws and/or columns)
  • Imputation Using (Mean/Median) Values.
  • Advantage: easy and fast.
  • Disadvantage: it doesn’t consider the correlations between features.
  • Imputation Using (Most Frequent) or (Zero/Constant) Values.
  • Imputation Using Deep Learning.
  • Imputation Using k-NN.

Our choice is imputation using Mean.

# Calculate Average age
m.Age<-mean(titanic.dataset$Age,na.rm=T) 

titanic.dataset<-titanic.dataset%>%
  mutate(Age=replace_na(Age,m.Age)) 

If you would like to know more about using k-NN for data imputations, here are a few references: The use of KNN for missing values.

2.6 Split data into train and test data



Figure 2:

# ##################################
# Create Train and Test data

titanic.dataset$id<-1:dim(titanic.dataset)[1]

# Randomize data: Create train and test dataset
train.sample <- sample(x=1:dim(titanic.dataset)[1],
                       size=floor(dim(titanic.dataset)[1]*0.75), # Train data consists of 75% of the dataset
                       replace=F)

titanic.train<-titanic.dataset%>%
  filter(id %in%train.sample)
titanic.test<-titanic.dataset%>%
  filter(!(id %in%train.sample))

2.7 Feature selection:



# Variables used in the model
columns<-c('Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Survived')
titanic.train<-titanic.dataset[,columns]
titanic.test<-titanic.test[,columns]

3 Model



3.1 Machine Learning Algorithms



Figure 3:

3.2 Decision Tree



Decision trees partition the feature space into a set of rectangles, and then fit a simple model in each one (e.g. a costant). A decision tree is a tree whose inner nodes represen features. Each edge stands for an feature value, and each leaf node is given a class value (i.e. a constant). Decision trees are fairly intuitive and their predictions are easy to interpret.

Figure:

Figure:



3.3 Random Forest



A random forest is an ensembe of decision trees trained via the boostraping aggregating method (bagging). With random forest, we get a diverse of classifier by training decision trees on different random subsets of the training dataset. When sampling is performed with replacement, the methid is called bagging. When the sampling is performed without replacement, it is called pasting. The motivation Once all predictors are trained, the ensembe can make a predictions for a new data point by aggregating the predictions of all predictors. Generally, a random forest has a lower variance than a single decision tree trained on the original training dataset.

We decide to use the random forest since it is not too slow compare to Adaboosting and Neural Networks, and it has a good performance (good Accuracy, Precision and sensitivity).

Figure 4:

Figure 4:

# Model: Random Forest
chosen.learner<-'classif.randomForest'

3.4 Tune Model



Almost all machine learning algorithms have hyperparameters, specifications to control the algorithm’s performance. The values of the hyperparameters are not adapted by the training algorithm itselt. The hyperparameters cannot simply be learned with the training set, since it would always choose the most complex models and it would result in overfitting. To solve this proble, we need a validation set of examples that the training algorithm does not observe.

# ##################################
# Hyper parameters
discrete_ps = makeParamSet(
  makeDiscreteParam("ntree",
                    values =c(1,5)),
  makeDiscreteParam("mtry",
                    values =c(1,2))
  )

3.4.1 Cross validation (K-fold)



Cross validation is a widely used method for estimating prediction error. Ideally, if we have enoguh data, we would set aside a validation set and use it to assess the performance of our prediction model. The K-fold cross validation uses part of the train data to fit the model, and a different part to test it. We split the data into K equal-sized parts. We fit train the model with K-1 parts and test it with the kth part of the data. We do this for \(k\in \{1,...,K\}\), and combine the K estimates of prediction error.

Figure 6:

Figure 6:

# ##################################
# Cross validation: 10 fold.
  rdesc<-makeResampleDesc(method="CV",
                          iters = 10, # Our K-fold equals 10-fold
                          predict = 'test')

  ctrl<-makeTuneControlGrid()

An alernative to 10-fold is one-leave out. However, since the dataset is not small, the one-leavel out is computational expensive.

  chosen.model<-  'classif.randomForest'
  chosen.learner<-makeLearner(chosen.model,
                              predict.type = "prob",
                              par.vals = list())

3.4.2 Optimal Model

  optimal.rf<-tuneParams(chosen.learner, task = classifi.model, resampling =
                    rdesc,
                  par.set = discrete_ps, control=ctrl,
                  measures = list(acc,mmce,timetrain,f1,logloss)) # measures to be calculated
  
cv.output = generateHyperParsEffectData(optimal.rf)

3.4.2.1 Train Model

## #######################################
# Get Optimal mode#
# ########################################
chosen.learner<-makeLearner(chosen.model,
                            predict.type = "prob",
                            par.vals = list(ntree=optimal.rf$x$ntree,
                                            mtry=optimal.rf$x$mtry))

train.model<-mlr::train(learner=chosen.learner,
                        task=classifi.model)

3.5 Model Performance



R name Name Description
acc Accuracy The number of correct predictions made divided by the total number of predictions made
mmce Mean misclassification error 1-Accuracy
ppv Precision/Positive predictive value Precision helps when the costs of false positives are high.
tpr Recall/Sensitivity/ True positive rate Recall helps when the cost of false negatives is high
f1 F1 measure Harmonic mean of precision and recall. Harmonic mean is appropriate for situations when the average of rates is desired.
gpr Geometric mean of precision and recall A geometric mean is often used when comparing different items when each item has multiple properties that have different numeric ranges
ROC Curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points of a parameter.
auc Area under the curve Measure of how well a parameter can distinguish between two diagnostic groups
logloss Logarithmic loss (logloss) A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label.
Time to train
# ##################################
# Predictions (with optimal model)
pred.classif<-predict(train.model,
                      newdata =titanic.test)
# Performance Measures
chosen.measures<-list(acc,mmce,ppv,
                      tpr,f1,gpr,
                      auc,logloss,timetrain)

best.model.per<-performance(pred.classif, measures =chosen.measures,
            model=train.model)

Performance Measures

3.5 ROC-curve

The ROC (Receiver operating characteristic) curve is a common plot used for binary classifiers. In the ROC curve, the sensitivity against 1-specificity.

At point \((0,0)\) the random forest algorithm predicts that all passangers will not survive. While in the other extreme \((1,1)\), the random forest predicts that all passangers survives. The ideal machine learning algorithm would have a curve which consists only of the point (0,1), and thus has \(100\%\) specifity and \(100\%\) sensitivity.

col<-c('Random Forest'='#C38D9E')

roc.plot<-perf.data$data%>%
      ggplot()+
      geom_line(aes(x=fpr,y=tpr,col='Random Forest'), cex=1.2)+
      xlab('False positive rate')+ylab('True positive rate')+
      ggtitle('ROC curve')+theme_classic()+
        scale_color_manual(values=col,'')+
  geom_abline(intercept=0,slope=1)

  ggplotly(roc.plot)

ROC curve

3.6 Simulations

4 References


  1. If you would like to learn the DOS and DON’TS of presenting figures, I recommend you the book The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures by Dona M. Wong.